CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

conradarcturus · 2024-08-29T22:31:00Z

While we are improving the population data and likely subtags we are generating side-effects from partial data. This adds new scripts so we can avoid these side-effects in future changes. Ultimately we will want to remove how many overrides are here but it's good to fix this.

See the data updated in this diagram:

This PR completes the ticket. -- I'm submitting this request first to separate the changes
This PR stabilizes the data so its easier to follow up and fix other overrides.

Run this command to regenerate data: mvn package -DskipTests=true && java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData && java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

ALLOW_MANY_COMMITS=true

jira-pull-request-webhook · 2024-08-29T22:34:23Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

… and likely subtag overrides The generated files for ConvertLanguageData and GenerateLikelySubtags change if input files are modified. This change seeks to stablize the scripts outputs. CLDR-17897 Add overrides to Likely Subtags

jira-pull-request-webhook · 2024-08-29T22:42:38Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

… and likely subtag overrides The generated files for ConvertLanguageData and GenerateLikelySubtags change if input files are modified. This change seeks to stablize the scripts outputs. CLDR-17897 Add overrides to Likely Subtags

jira-pull-request-webhook · 2024-08-29T23:24:42Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

roozbehp · 2024-08-30T06:54:45Z

common/supplemental/likelySubtags.xml

@@ -703,10 +706,10 @@ not be patched by hand, as any changes made in that fashion may be lost.
 		<likelySubtag from="tiv" to="tiv_Latn_NG"/>		<!--Tiv‧?‧?	➡ Tiv‧Latin‧Nigeria-->
 		<likelySubtag from="tk" to="tk_Latn_TM"/>		<!--Turkmen‧?‧?	➡ Turkmen‧Latin‧Turkmenistan-->
 		<likelySubtag from="tkl" to="tkl_Latn_TK"/>		<!--Tokelau‧?‧?	➡ Tokelau‧Latin‧Tokelau-->
-		<likelySubtag from="tkr" to="tkr_Latn_AZ"/>		<!--Tsakhur‧?‧?	➡ Tsakhur‧Latin‧Azerbaijan-->
+		<likelySubtag from="tkr" to="tkr_Cyrl_AZ"/>		<!--Tsakhur‧?‧?	➡ Tsakhur‧Cyrillic‧Azerbaijan-->


This doesn't make sense. Tsakhur is written in Latin in Azerbaijan and in Cyrillic in Russia. The old value was correct.

Thanks for the close examination -- I'll re-introduce the overrides for these languages. I'm having a problem fighting the different sources of truth :p Definitely Latn should be considered the primary script in Azerbaijan.

I think the source problem is that "Cyrl" comes before "Latn" alphabetically and when the script is re-run now it takes the first alphabetical item.

roozbehp · 2024-08-30T06:58:38Z

common/supplemental/likelySubtags.xml

@@ -725,7 +728,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
 		<likelySubtag from="tt" to="tt_Cyrl_RU"/>		<!--Tatar‧?‧?	➡ Tatar‧Cyrillic‧Russia-->
 		<likelySubtag from="ttj" to="ttj_Latn_UG"/>		<!--Tooro‧?‧?	➡ Tooro‧Latin‧Uganda-->
 		<likelySubtag from="tts" to="tts_Thai_TH"/>		<!--Northeastern Thai‧?‧?	➡ Northeastern Thai‧Thai‧Thailand-->
-		<likelySubtag from="ttt" to="ttt_Latn_AZ"/>		<!--Muslim Tat‧?‧?	➡ Muslim Tat‧Latin‧Azerbaijan-->
+		<likelySubtag from="ttt" to="ttt_Cyrl_AZ"/>		<!--Muslim Tat‧?‧?	➡ Muslim Tat‧Cyrillic‧Azerbaijan-->


Same problem. Muslim Tat is written in Latin in Azerbaijan and in Cyrillic in Russia.

roozbehp · 2024-08-30T07:02:34Z

common/supplemental/likelySubtags.xml

@@ -1036,6 +1039,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
 		<likelySubtag from="und_Ahom" to="aho_Ahom_IN"/>		<!--?‧Ahom‧?	➡ Ahom‧Ahom‧India-->
 		<likelySubtag from="und_Arab" to="ar_Arab_EG"/>		<!--?‧Arabic‧?	➡ Arabic‧Arabic‧Egypt-->
 		<likelySubtag from="und_Arab_AF" to="fa_Arab_AF"/>		<!--?‧Arabic‧Afghanistan	➡ Persian‧Arabic‧Afghanistan-->
+		<likelySubtag from="und_Arab_AZ" to="tly_Arab_AZ"/>		<!--?‧Arabic‧Azerbaijan	➡ Talysh‧Arabic‧Azerbaijan-->


If you have something unknown in Arabic script in Azerbaijan, it's probably not Talysh (which has a pretty small community compared to Azerbaijani, where they write in Latin). It's very probably Azerbaijani in the old orthography.

srl295 · 2024-08-31T13:20:00Z

common/supplemental/supplementalData.xml

@@ -1890,7 +1890,7 @@ XXX Code for transations where no currency is involved
 		<language type="lv" scripts="Latn" territories="LV"/>
 		<language type="lwl" scripts="Thai"/>
 		<language type="lzh" scripts="Hans" alt="secondary"/>
-		<language type="lzz" scripts="Latn Geor"/>
+		<language type="lzz" scripts="Geor Latn"/>


Is the ordering significant?

The code is writing it alphabetically, so when I re-generate the script it force-alphabetizes it. There is an argument it should be ordered by usage --- however the XML is just not a good way to capture this because the labelling is unclear.

The scripts (and regions) should be in ranked order, not sorted. If the code is sorting them, that's a bug.

It looks like if Roozbehs' items are taken care of, then this would be ready to merge into 47.

I'm glad Roozbeh took a look ;) I resolved this by changing the non-Latn script for these languages to be considered "secondary" in language_script.tsv

Merging this changes ended up getting really messy so I'll post a new pull request.

srl295 · 2024-08-31T13:22:13Z

tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/language_script.tsv

+pnt	Pontic	secondary	Cyrl	Cyrillic
+pnt	Pontic	secondary	Latn	Latin


What's the basis of making these secondary?

Promoting Grek to be the primary script for Pontic.

Really for all current Pontic speakers its Grek in Greece, Latn in Turkey, and Cyrl in Russia/Ukraine. Pontic is only spoken by very marginal populations in Turkey and Russia, but it's a large recognized community in Greece.

What's the basis for primary v secondary?

The primary vs secondary should be based on the literate population sizes. I forget what the cutoff is, but clearly if >50% of the usage of the language is in a particular script, that would be primary, not secondary. (But again, there might be bug in the code.)

macchiati · 2024-09-04T00:53:33Z

tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateLikelySubtags.java

@@ -392,6 +393,9 @@ public static void main(String[] args) throws IOException {
                                {"mro", "mro_Mroo_BD"},
                                {"mro_BD", "mro_Mroo_BD"},
                                {"ms_Arab", "ms_Arab_MY"},
+                                {"nan", "nan_Hans_CN"},
+                                {"nan_Hans", "nan_Hans_CN"},


This won't hurt anything, but nan_Hans is redundant, because the algorithm will find {"nan", "nan_Hans_CN"}, and fill in.

There is a ticket open for dropping overrides that have no effect, so it is ok to keep this line for now.

Interestingly, I need to keep this like otherwise in the produced likelySubtags.xml file, it will show nan_Hans -> nan_Hans_TW even though as we know the Hant script would be preferred in Taiwan. The problem is that we don't have population estimates on Simplified v Traditional Chinese script usage.

conradarcturus · 2024-09-04T04:19:01Z

Thanks everyone for the comments! It helped me make a better version of this PR in #4015.

Apologies for making a separate one -- rebasing it to the ddl/v47 branch introduced weird merge artifacts so I just made a new PR.

conradarcturus requested review from srl295 and macchiati August 29, 2024 22:31

github-actions bot assigned conradarcturus Aug 29, 2024

conradarcturus force-pushed the CLDR-17884-Add-primary-scripts branch from a77d66d to 3c0661a Compare August 29, 2024 22:34

conradarcturus force-pushed the CLDR-17884-Add-primary-scripts branch from 3c0661a to e5fa96d Compare August 29, 2024 22:42

conradarcturus added 2 commits August 29, 2024 16:22

CLDR-17897 Merging

0756612

conradarcturus force-pushed the CLDR-17884-Add-primary-scripts branch from da92bfe to 0756612 Compare August 29, 2024 23:24

roozbehp requested changes Aug 30, 2024

View reviewed changes

srl295 reviewed Aug 31, 2024

View reviewed changes

macchiati reviewed Sep 4, 2024

View reviewed changes

conradarcturus closed this Sep 4, 2024

conradarcturus deleted the CLDR-17884-Add-primary-scripts branch September 4, 2024 04:19

conradarcturus mentioned this pull request Oct 1, 2024

CLDR-17897 Make ConvertLanguageData Consistent #4015

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

conradarcturus commented Aug 29, 2024

jira-pull-request-webhook bot commented Aug 29, 2024

jira-pull-request-webhook bot commented Aug 29, 2024

jira-pull-request-webhook bot commented Aug 29, 2024

roozbehp Aug 30, 2024

conradarcturus Sep 3, 2024

roozbehp Aug 30, 2024

roozbehp Aug 30, 2024

srl295 Aug 31, 2024

conradarcturus Sep 3, 2024

macchiati Sep 4, 2024

macchiati Sep 4, 2024

conradarcturus Sep 4, 2024

srl295 Aug 31, 2024

conradarcturus Sep 3, 2024

macchiati Sep 3, 2024

macchiati Sep 4, 2024

conradarcturus Sep 4, 2024

conradarcturus commented Sep 4, 2024

		pnt Pontic secondary Cyrl Cyrillic
		pnt Pontic secondary Latn Latin

CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

CLDR-17897 Fix unstable scripts when running GenerateLikelySubtags and ConvertLanguageData #3998

Conversation

conradarcturus commented Aug 29, 2024

jira-pull-request-webhook bot commented Aug 29, 2024

jira-pull-request-webhook bot commented Aug 29, 2024

jira-pull-request-webhook bot commented Aug 29, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conradarcturus commented Sep 4, 2024